White Wine Quality by Jacob Kreisler

Introduction

We take a look at a dataset of nearly 5,000 white wines to find relationships and patterns from the ingredients in the wine. We will look at each ingredient individually, and then try to find out which of these variables or combination of variables contributes most to the overall quality of the wine. Cheers!

Univariate Plots Section

## [1] 4898   12
## 'data.frame':    4898 obs. of  12 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.878  
##  3rd Qu.:6.000  
##  Max.   :9.000

Our dataset consists of eleven variables with one output attribute - quality. There are almost 4,900 observations. Here are descriptions of the variables:

1 - fixed acidity: most acids involved with wine are fixed, or, nonvolatile (do not evaporate readily)

2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines

4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

5 - chlorides: the amount of salt in the wine

6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content

9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

11 - alcohol: the percent alcohol content of the wine

Output variable (based on sensory data): 12 - quality (score between 0 and 10)

##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.878  
##  3rd Qu.:6.000  
##  Max.   :9.000

By plotting the ‘quality’ variable, we can see the distribution of scores. The results show scores centered to the right of the middle of the range (5), with extreme cases at 3 and 9.

##     alcohol     
##  Min.   : 8.00  
##  1st Qu.: 9.50  
##  Median :10.40  
##  Mean   :10.51  
##  3rd Qu.:11.40  
##  Max.   :14.20

The amount of alcohol in the wines very. By altering the bin size, we are able to get a better look at how the data lies. It’s clear that the bulk of observations lie between 8.5% - 13% alcohol by volume, with the most frequent observation at around 9.5%.


Acids makes wine sour and helps it from going flat. They contribute greatly to its taste. Acidity is split between two categories: Fixed and Volative:

##  fixed.acidity   
##  Min.   : 3.800  
##  1st Qu.: 6.300  
##  Median : 6.800  
##  Mean   : 6.855  
##  3rd Qu.: 7.300  
##  Max.   :14.200

##  volatile.acidity
##  Min.   :0.0800  
##  1st Qu.:0.2100  
##  Median :0.2600  
##  Mean   :0.2782  
##  3rd Qu.:0.3200  
##  Max.   :1.1000

The data seems to cluster around a smaller range than alcohol, but volatile acidity is positively skewed, with outliers that contain more acidity. I set the x axis limits to omit these outliers to get a better sense of the data. Will these outliers give a better indictition to acidity’s effect on the overall quality?

##   citric.acid    
##  Min.   :0.0000  
##  1st Qu.:0.2700  
##  Median :0.3200  
##  Mean   :0.3342  
##  3rd Qu.:0.3900  
##  Max.   :1.6600

Similar to the distribution of volatile acidity, the data is positvely skewed with outliers that contain more citric acid. I limit these outliers to better visualize the data.

##  residual.sugar  
##  Min.   : 0.600  
##  1st Qu.: 1.700  
##  Median : 5.200  
##  Mean   : 6.391  
##  3rd Qu.: 9.900  
##  Max.   :65.800

The data here is very spread out, again positevly skewed with several significant outliers. The observations are beginning to show that most wine variables lie within a relatively tight range, and the outliers are those wines whose creators decided to use MORE of an ingredient, not less. Will the risk pay off?

##    chlorides      
##  Min.   :0.00900  
##  1st Qu.:0.03600  
##  Median :0.04300  
##  Mean   :0.04577  
##  3rd Qu.:0.05000  
##  Max.   :0.34600

Chlorides are minerals dissolved in the wine. High chloride levels would give the wine a saltier taste.

##  free.sulfur.dioxide
##  Min.   :  2.00     
##  1st Qu.: 23.00     
##  Median : 34.00     
##  Mean   : 35.31     
##  3rd Qu.: 46.00     
##  Max.   :289.00

The distrubtion is normal when removing outliers.

##  total.sulfur.dioxide
##  Min.   :  9.0       
##  1st Qu.:108.0       
##  Median :134.0       
##  Mean   :138.4       
##  3rd Qu.:167.0       
##  Max.   :440.0

Normal distribution here as well. Sulfur dioxide helps prevent microbial growth and oxidation in the wine.

##     density      
##  Min.   :0.9871  
##  1st Qu.:0.9917  
##  Median :0.9937  
##  Mean   :0.9940  
##  3rd Qu.:0.9961  
##  Max.   :1.0390

##    sulphates     
##  Min.   :0.2200  
##  1st Qu.:0.4100  
##  Median :0.4700  
##  Mean   :0.4898  
##  3rd Qu.:0.5500  
##  Max.   :1.0800

Density and Sulphates have normal distributions. Nothing really stands out from the data yet.

##        pH       
##  Min.   :2.720  
##  1st Qu.:3.090  
##  Median :3.180  
##  Mean   :3.188  
##  3rd Qu.:3.280  
##  Max.   :3.820

The pH is clustered around 3, which is about the same acidity as orange or grapfruit juice.

So if the pH indicates how acidic the wine is, is there a “sweet-spot” that will indicate the overal quality of the wine? I create a subset of wines with quality ratings of 8 or 9 and take a look at the spread of pH values.

##        pH       
##  Min.   :2.940  
##  1st Qu.:3.127  
##  Median :3.230  
##  Mean   :3.221  
##  3rd Qu.:3.330  
##  Max.   :3.590

No, the distribution of pH values appears very similar to the overall spread. What about the lowest quality wines?

##        pH       
##  Min.   :2.830  
##  1st Qu.:3.060  
##  Median :3.160  
##  Mean   :3.183  
##  3rd Qu.:3.285  
##  Max.   :3.720

Again, the distribution of pH values for wines with quality ratings of 3 and 4 do not show any real connection to quality.

So if pH does not dictate overall quality, what does?

Univariate Analysis

What is the structure of your dataset?

The dataset contains 4,898 white wines with 12 variables - fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, and quality. The quality variable is a scale, numbered 0 through 10, that serves to represent how good the wine is overall. The lowest quality observed is 3, while the highest is 9. The median quality is 6.

What is/are the main feature(s) of interest in your dataset?

The main feature of interest in the dataset is quality. My objective is to find out how the other 11 variables contribute to, or take away from, this final outcome.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I believe there will be more than one answer to this question, as quality comes from more than one variable, but I suspect that acidity and residual sugar will contribute most. I think these should contribute most to overall taste.

Did you create any new variables from existing variables in the dataset?

Yes, I created two subsets of the pH variable - one for wines with quality ratings of 8 and 9, and another for quality ratings of 3 and 4.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Most of the distributions contained outliers, generally toward the positive direction, and so through adjusting the bin width and x limit, I was able to get a better sense of the overall data. A good example of this is the residual sugar variable. The shape of the data completely transforms once the outliers are removed. Bin width was computed using the Freedman-Diaconis rule, where width = 2 * IQR * n^(-1/3)

Bivariate Plots Section

##                 pH
## quality 0.09942725

The chart confirms that there is no real link between pH and quality, and the correlation coefficient of about 0.1 mirrors that conclusion.

##         residual.sugar
## quality    -0.09757683

The relationship between residual sugar and quality is just as weak as pH, with correlation matching at about -0.1. Is there a relationship between pH and residual sugar?

##    residual.sugar
## pH     -0.1941335

We can see that there is no real connection between both variables, as the data is spread out wildly on the plot. Correlation of -0.2 only serves to confirm this.

So what other variables could we test?

## NULL

##            density
## alcohol -0.7801376

##                  density
## residual.sugar 0.8389665

So, we have two strong relationships here regarding density. We can see that there is a strong positive correlation between residual sugar and wine density, and a strong negative correlation between alcohol and wine density. Perhaps density has something to do with quality?

##            density
## quality -0.3071233

If we know that there is a relatively weak relationship between density and quality, and density is affected by residual sugar and alcohol, we may be able to make assumptions as to how those two variables affect overall quality. Do higher quality wines have more alchol and less residual sugar?

I create a subset of “top shelf” wines that contain the highest 25% of alcohol and lowest 25% of residual sugar and compare that subset to the overall dataset…

##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.878  
##  3rd Qu.:6.000  
##  Max.   :9.000
##     quality     
##  Min.   :3.000  
##  1st Qu.:6.000  
##  Median :6.000  
##  Mean   :6.122  
##  3rd Qu.:7.000  
##  Max.   :9.000

Both the charts and the statistics show a real difference.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

I learned that the feauture of interest, quality, is not affected as strongly by pH and residual sugar. In fact, there are no variables that strongly correlate to a higher or lower quality rating. We found that density, which is affected strongly by alcohol and residual sugar, may be the best indicator of overall quality.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Yes, I observed that density is strongly correlated to alcohol and residual sugar levels in the wines.

What was the strongest relationship you found?

The positive relationship between residual sugar and density. The correlation coefficient is 0.839

Multivariate Plots Section

We can see pretty quickly that residual sugar and pH don’t affect the overall quility of wine as strongly as I thought it would. Residual sugar tends to be higher in lower quality wines, but the strength of the pattern is not overwhelming

Once plotted, we can see a clear pattern emerge as the color of higher quality wines is more dense in the lower right hand side of the chart. This shows that as we filter those wines with higher alcohol levels, and lower residual sugar, we are left with higher quality wines.

Isolating wines with quality of 8 and 9, we get a better sense of where these wines lay in the overall distribution. This confirms the findings of previous plot. We can see higher density of observations in the lower right corner, especially with wines of 9 quality.

Moving from the top left chart, showing quality 3, to the bottom right, showing quality 9, you can clearly see the data moving down and to the right as the quality increases.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Most of the observations served to confirm what was discovered in the previous section. When we plot the relationship between residual sugar, alcohol, and quality, we can visualize the data moving toward the direction we predicted.

Were there any interesting or surprising interactions between features?

Not really. The quality of wine is hard to predict with one or two variables. Although we have found a trend, there are no strong predictor(s).

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.


Final Plots and Summary

Plot One

##    residual.sugar
## pH     -0.2876887

Description One

This is the chart that disproved my theory. By layering the distrubtion of the lowest quality wines over the highest quality wines, its clear that there is almost no difference. The range of values for both residual sugar and pH are equally disbursed. There is no clear pattern that would distinguish a high quality wine from a low quality one. The correlation coefficient of -0.288 confirms this relationship.

Plot Two

##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.878  
##  3rd Qu.:6.000  
##  Max.   :9.000
##     quality     
##  Min.   :3.000  
##  1st Qu.:6.000  
##  Median :6.000  
##  Mean   :6.122  
##  3rd Qu.:7.000  
##  Max.   :9.000

Description Two

Looking at the subset chart on the right, we can see that the distribution has shifted to the right compared to the chart on the left. The peaks that were previously found on quality 5 and 6 have shifted to higher peaks on 7, 8, and 9. Comparing the means, we see an increase to 6.12 from 5.88.

Plot Three

Description Three

As you move toward the bottom right corner of the chart, the corner representing high alcohol and low residual sugar levels, you see a greater cluster of higher quality wines.


Reflection

In closing, there have been both challenges and successes in analyzing this dataset. The successes came from analyzing each variable and taking a look at how they were distributed. By changing bin size and limiting outliers, I believe I was able to get a good sense of the characteristics of each variable. The challenges came from finding connections between these variables and how those connections could serve to predict the quality of a wine. There may be missing variables in the dataset that play an important role in a wine’s overall quality. Maybe the quality variable is subjective, and different taste buds prefer different combinations of ingredients. In this dataset, there were no real strong connections between any one variable and quality, so I focused on finding connections between the other variables and then used those findings to circle back to the quality outcome. In many instances, plotting relationships and visualizing the data allowed me to brainstorm and find connections that I otherwise would not have found. Using a correlation matrix was also helpful in helpting to focus in on particular relationships in the data.

For further anlysis in the future, perhaps it would be benefitial to analyze more variables in the wines to see if they have any connection to quality. Variables like vintage year, vineyard location, or even price may hold the answer to this question.

When I began this project, I had assumed that the data would reveal clear patterns in the variables that would make it possible to predict a high quality wine. After digging into the data, analyzing my findings, and realizing that I was wrong, its become very clear to me that these findings make a lot of sense. There is no clear formula for the “perfect wine.” So, perhaps there are generalities that we can take away from the data, and ranges that the wine’s ingredients should stay within, but I believe the real answer is that a great wine is all about balance - how the variables come together in such a way to produce something great…